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ABSTRACT 

The Mantel Haenszel (MH; 1959) approach of Holland 
and Thayer (1988) is a well-established method for assessing 
differential item functioning (DIF) . The formula for the variance of 
the MH DIF statistic is based on work by Phillips and Holland (1987) 
and Robins, Breslow, and Greenland (1986). Recent simulation studies 
showed that the MH variances tended to be larger when items were 
aduinistered to "examinees" who were randomly selected from a 
population than when items were administered adaptively. An analytic 
perspective shed some light on this puzzling result. Although the 
general form of the MH variance is complex and does not provide an 
intuitive understanding of the phenomenon, application of certain 
Rasch model assumptions yields a simple expression that appears to 
explain the difference in variances for adaptive versus nonadaptive 
administration. One table, two figures. (Contains 13 references.) 
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Abstract 

The Mantel-Haenszel (MH; 1959) approach of Holland and Thayer (1988) is a well- 
established method for assessing differential item functioning (DIF). The formula for the variance 
of the MH DIF statistic is based on work by Phillips and Holland (1987) and Robins, Breslow and 
Greenland (1986). Recent simulation studies showed that the MH variances tended to be larger 
when items were administered to "examinees" who were randomly selected from a population than 
when items were administered adaptively. An analytic perspective shed some light on this puzzling 
result. Although the general form of the MH variance is complex and does not provide an intuitive 
understanding of the phenomenon, application of certain Rasch model assumptions yields a simple 
expression that appears to explain the difference in variances for adaptive versus nonadaptive 
administration. 

Keywords: Differential item functioning, Mantel-Haenszel, item bias, Rasch model 
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The Effect of the Probability of Correct Response on the Variability of Measures of 

Differential Item Functioning 

Overview 

The Mantel-Haenszel (MH; 1959) approach of Holland and Thayer (1988) is a well- 
established method for assessing differential item functioning (DIF). The MH index of DIP, MH 
D-DIF, is proportional to the natural log of the MH odds ratio estimate, a MH . Phillips and Holland 
(1987) used a new approach to derive an estimated standard error for ln(a MH ) ; their result proved 
to be identical to that of Robins, Breslow and Greenland (1986). 

Recent simulation studies (Zwick, Thayer, & Wingersky, in press; 1994) showed a 
puzzling result: The standard error of MH D-DIF tended to be larger when items were 
administered to "examinees" who were randomly selected from a population than when items were 
administered adaptively; that is, using an algorithm that selects items with difficulty levels close to 
the examinee's ability. 

The general form of the Phillips-Holland variance formula is quite complex and does not 
lend itself to an intuitive understanding of this phenomenon. However, if DIF is assumed to be 
absent and if the item response functions are assumed to follow the Rasch model, the variance 
takes on a simple form that appears to explain the difference in standard errors in adaptive versus 
nonadaptive administration. 

Mantel-Haenszel DIF Analysis 
In the MH approach, examinees are first grouped on the basis of a matching variable that is 
intended to be a measure of ability in the area of interest. In typical DIF applications, the matching 
variable is the total score on the test in which the item under investigation--^ studied item-is 
embedded. 

The score on the studied item, group membership, and the value of the matching variable 
for each examinee define a 2 x 2 x K cross-classification of examinee data, where K is the number 



of levels of the matching variable. Assume that there are examinees at the £th level of the 
matching variable. Of these, n Fk are in the group of interest- the focal group-and n^. are in the 
comparison group, or reference group. Of the n u reference group members, assume that 
answered the studied item correctly while did not. Similarly of the n fk matched focal group 
members answered the studied item correctly, whereas did not. 

Within the Mi level of the matching variable, let the ratio of the odds of answering the item 
correctly for the reference group to the corresponding odds for the focal group be defined as a k , k 
= 1, 2, K. The MH J 2 test approximates the uniformly most powerful unbiased test of the 
hypothesis, 



Hq: oc k - a = 1, k - 1, 2, K 



(1) 



versus the alternative, 



Hj: a k = a, a * 1, k = 1, 2, K 



(2) 



(Holland & Thayer, 1988). 



The MH measure of DEF is 



MH D- D1F=- 2.35 ln(a MH ) 



(3) 



where a MH is the Mantel-Haenszei conditional odds-ratio estimator given by 



l\D k /T k 



a. 



k 



lB k C k /T k 



In equation 3, the transformation of d MH places MH D-D1F on the ETS delta scale of ?m 
difficulty (Holland & Thayer, 1985). The effect of the minus sign is to make MH D-DIF negative 



when the item is more difficult for members of the focal group than it is for comparable members 
of the reference group. 

The variance of ln( d MH ) is given by 



Var(ln(d MH )) = JL £ j^E[{A k D k + aB k C k )(A k + D k + a{B k + C k ))], 



(4) 



where 



k 



M = EC£A k D k /T k ) 



(5) 



k=\ 



(see Phillips & Holland, 1987, their equations 4 and 8). 1 The sample estimate is 



lU k V k /T t 2 



Var(ln(d MH )) = 



2&A k D k IT k ) 



k 



.2 ' 



(6) 



k 



where U k = (A k D k ) + a MH (B k C k ) and V k = (A, +D k ) + a MH (B k + C k ) (Phillips & Holland, 
1987, their equation 9). The estimated standard error for MHD-DIF can then be expressed as 

SE(MH D - DIF) = 235-7 Var(ln(d MH )) 

(Holland & Thayer, 1988). 



In the simulation study of Zwick, Thayer, and Wingersky (in press), item responses were 
generated using a 3PL model. In the main portion of the study, which involved computer-adaptive 
tests (CATs), 25 items cut of a pool of 75 were "administered" to each examinee using an 
algorithm that selected the most informative item at the examinee's current estimate of ability. 
Ability was reestimated following each item response. A separate portion of the study investigated 
DEF results for items that were administered nonadaptively. A comparison showed that SE(MH D- 



Motivation for the Derivation of a Simplified Form for Var(ln(d MH )) 



DIF) tended to be larger for nonadaptive than for CAT items, although the sets of items had similar 
generating parameters and DEF properties. For example, for n R = 900 reference group members 
and n F = 100 focal group members, SE(MH D-DIF) had a range of 0.6 to 1.1 for the nonadaptive 
items, compared to 0.5 to 0.7 for the CAT items. 

To explore this issue further, both nonadaptive and CAT DEF results were obtained for the 
items in the CAT pool. Figures 1 and 2 (from Zwick, Thayer, & Wingersky, 1994) show these 
MH D-DIF statistics and their standard errors, respectively. (There were 71, rather than 75, items 
because four items were never administered in the simulated CAT.) The values plotted along the 
horizont?J axis are based on nonadaptive administration to n R = 900 reference group members and 
n ? = 100 focal group members. The values plotted along the vertical axis are based on only the 
examinees who received the item in a CAT administration. The method used to estimate the CAT- 
based MH D-DIF statistics and their standard errors for n R = 900, n F = 100 is described in Zwick, 
Thayer, and Wingersky (in press). The same matching variable, an item response theory (IRT)- 
based expected true score, was used for the CAT and nonadaptive DEF statistics. 



Insert Figures 1-2 about here. 



Figure 1 shows that the MH D-DIF statistics are clustered around the 45-degree line; there 
were no systematic differences for the two types of administration. The standard errors, however 
were substantially different, with the nonz.daptive standard errors exceeding the CAT standard 
errors by an average of 15 percent. Figure 2 shows that the nonadaptive standard errors were 
larger for almost every item. 

To determine whether these findings were related to the particular CAT algorithm used in 
this study, another analysis was conducted, comparing nonadaptive administration to a "pseudo- 
CAT" administration, in which examinees were eliminated from the DEF analysis of a particular 
item if their abilities departed subi antially from the estimated item difficulty. Results were very 
similar to those obtained in Figures 1 and 2, suggesting that this phenomenon was associated with 

iO 



the ability range of the examinees, but was not unique to the implemented CAT algorithm. A later 
study by Way (1994), which used a different CAT algorithm, revealed the same phenomenon. 



where X is the score on the item, X = (0, 1), with " 1 " indicating a correct response, 6 represents 
ability, and b G is the item difficulty in group G (G=R, F). Although a model that ignores guessing 
and treats items as equally discriminating cannot be expected to hold for typical multiple-choice 
tests, the Rasch model often proves useful for explanatory purposes. For example, Holland and 
Thayer (1988) offered a Rasch-based analysis that elucidates the relation between MH D-DIF and 
the item difficulty parameters for the reference and focal groups. From an IRT perspective, DIF 
can be defined as a difference in item response functions(IRFs) for two groups (Lord, 1980). 
Holland and Thayer (1988) showed that under certain conditions, identity of item response 
functions across groups for the studied item satisfies the MH null hypothesis (equation 1) and a 
difference in IRFs across groups corresponds to the MH alternative hypothesis (equation 2). The 
required conditions are: 

(i) within each of the groups (reference and focal), the IRFs follow the Rasch model in (7), 

(ii) the matching variable is the number-right score based on all items, including the studied item, 
and 

(iii) the items have the same IRFs for the two groups, with the possible exception of the studied 
item. 

Under these conditions, the odds ratios a k in (1) and (2) are equal to exp(fc F ~b K ), where b F and 
b R are the item difficulties for the reference and focal groups, respectively. The quantity 
exp(fc F -b R ) is constant across all levels of the matching variable and is equal to one when the 



MH D-DIF and its Standard Error under the Rasch Model 



In the Rasch model, the probability of correct response in group G can be expressed as 




(7) 



reference and focal groups have the same IRF. Zwick (1990) showed that the correspondence 
between ERF and MH definitions of DBF could not be assured to hold for a more general class of 
item response models that includes the usual two- and three-parameter logistic (2PL and 3PL) 
models. 2 Nevertheless, in simulation studies, this result has been found to hold approximately 
under moderate departures from (7). Because the Rasch model has provided useful insights about 
the behavior of the MHD-DIF statistic, even when the true model is more complex, it seemed 
worthwhile to consider a Rasch-based analysis of SE(MH D-DIF) in attempdng to explain the 
surprising finding about MH standard errors. 

The result of applying Rasch model and no-DEF assumptions to the Phillips and Holland 
(1987) variance in equation 4 (above) will now be demonstrated. As a first step, the expected table 
frequencies will be obtained under the following simplifying assumptions: 

(1) The matching variable is 5, the total test score, including the studied item. 

(2) IRFs for the two groups can be represented by a single Rasch model (as in 7). This further 
implies that 

(2a) Conditional independence holds, i.e., P(X= x 1 9) = Y[ p ( X i = x t 1 0 ) f° r ^ 6 > where i 

i 

indexes items. 

(2b) There is no DDf; that is, P R (X=l 1 9) = P F (X=l 1 9) holds for all items. 
Assumption 2b may seem unduly restrictive, but, in fact, simulation results show that the size of 
the MH standard errors depends very little on the true magnitude of DD? (e.g., Zwick, Thayer, & 
Wingersky, in press; 1994). 

To obtain the expected table frequencies under these assumptions, a useful property of the 
Rasch model can be exploited: Under this model, S is sufficient for 6 and therefore, 
P G (X=1 1 S,0)=P G (X=1 1 5) (Zwick, 1990). Invoking this property, as well as the above 

assumptions, it can be shown that P R (X=1 1 5)=P F (X=1 1 5). Note that identity of item response 
functions for the reference and focal groups (assumption 2b) does not, in general, imply that 
P R (X=1 1 5)=P F (X=1 1 5). For example, this implication would not hold for the usual 2PL and 3PL 
models. 



The expected frequencies in stratum k of the 2 x 2 x K table are as shown in Table 1 , where 
n k = P P (X=1 1 S=s k )=Pp(X=l I S=s k ) is the probability of a correct response in stratum k, and 
Hz,, and n n are the reference and focal group frequencies, respectively, in stratum k . 



Insert Table 1 about here 



In deriving their variance formula, Phillips and Holland (1987, equation 4 above) assumed 
that A^andQ. follow the independent binomial distributions, A k ~B{n^ k ,% k ) and C k ~ B(n Fk ,n k ). 

(Alternatively, it can be assumed that A k is a hypergeometric variate.) In the present context, the 

assumption of binomial distributions is redundant, since the Rasch model assumptions imply that, 
conditional on T k , A k ~B(n Rk ,K k ) andQ. ~B(n Fk ,n k ) (Rasch, 1960, p. 180). 

Now, because Assumption 2b implies that a=l , equation 4 can be simplified as follows: 



where M is given by equation 5. Invoking the independence of the two binomials yields 



The expected cell frequencies from Table 1 can now be substituted, leading to the result 



Var(ln(a MH )) 





where M can now be expressed as M- X — E(A k )E{D k ). 



Var(ln(a MH ))=M 



-i 




(8) 



The expression in (8) is a theoretical formulation of the variance of ln(d MH ) under the stated Rasch 
assumptions. 3 

Note the obvious similarity of the result in (8) to the asymptotic variance of the logit of a 
sample proportion, p , which is given by 

Var(logit(p))= var(ln(Ml - p)) = inx(l - tt))" 1 , (16) 

where % is the population proportion (Agresti, 1990). 4 When it = .5, the variance of logit(/?)is 
minimized. Similarly, for fixed values of K, and n ?k , the variance in (8) is minimized if the 
probability of a correct response in each stratum is equal to .5. The condition in which n k = .5 for 

all k is consistent with CAT administration, which provides for items to be assigned to examinees 
of an appropriate, and thus rather narrow, ability range. Large departures from this condition 
occur for nonadaptive items. These findings appear to shed light on the differences in the MH 
standard errors for adaptive and nonadaptive administration. 

Two subtle aspects of these results are worthy of note. First, comparison of the MH 
variances for adaptive an' 4 nonadaptive items is somewhat more complicated than the exposition 
above suggests. The variances for these two types of items differ not only in terms of the values 
of n k , but in terms of the values of n^. and n H and possibly K, the number of strata. In an 

adaptive administration, examinees are typically concentrated in a smaller number of test score 
levels. Second, whereas the estimated MH variances are related to the item proportion correct (the 
classical item difficulty), they do not have a straightforward relationship to the IRT difficulty 
parameter, b. For a given value of b, the jjioportion correct for an adaptive item will tend to be 
closer to .5 than the proportion correct for a nonadaptive item. Therefore, MH standard errors will 
tend to be larger for a nonadaptive item than for an adaptive item with the same b value. 
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Footnotes 



'The variance of ln(a MH ) is infinite since ln(a MH ) can take on the values ±°° with positive 
probability. Strictly speaking, the Phillips-Holland variance formula applies not to ln(a MH ), but to 
a Taylor series approximation to ln(a MH ) that does have a finite variance. This qualification 
applies to all references to Var( ln(a MH )) in this paper. 

2 Fischer (1993) showed that the definitions of DEF agree for very restrictive cases of the 
2PL. Among other requirements, it is necessary that the discrimination parameters be known and 
that they be the same for the reference and focal groups on each item. 

3 Although this need not have been the case, substitution of the expected cell counts of 
Table 1 for their sample counterparts in the estimated variance formula of equation 6 produces the 
same result as (8). Without the more rigorous derivation, however, it would be difficult to know 
how to interpret the obtained expression. 

4 This variance formula is again based on Taylor series approximation; see footnote 1. 



